Exploratory data analysis

Table of Contents

Data import


Split the data into train_df (80%) and test_df (20%).

Data overview


Data Format

Column Name Column Type Description
Id Numeric Unique ID assigned to each observation.
Text Free Text Body of the review content.
Author Categorical Author's name of the review
Rating Numeric Ratings given along with the review

Identify drop features


Visualizing features


The distribution of ratings

The most of ratings are from 0.4 to 0.9. There is some imbalance for our training dataset.

The list of top 20 frequent word appeared in text

From the above table, we could easily find that the top 20 words are meaningless to do the prediction. When we build our model we need to take it into account to ignore these words.

The relationship between text length and ratings

From the graph, we can see the text length has no clear relationship with the rating, the average of len is around from 4000 to 5000. Except 0.9-1.0, its length is over 5,500.